Birdclef 2021¶

Birdcall Identification¶

image.png

Tobias Priesholm Gårdhus & Kaare Endrup Iversen

Your browser does not support the audio element.

Motivation¶


Technical¶

  • Working with "complex" mixed data-sources including audio
  • "Real world" problem

Ethical¶

  • Pros: Contributing to wild-life monitoring and preservation
  • Cons: Enables automatic survaillance

Fun¶

  • Working in a field new to both of us

Data¶

8548 Audio files containing 27 different species

Procedure¶

Mel Spectograms¶

Feature Engineering¶

Noise reduction¶

Feature Engineering¶

Noise reduction¶

Your browser does not support the audio element.
Your browser does not support the audio element.

Split¶

Audio Augmentation¶

Audio Augmentation¶

Numeric features¶

BPM¶

Numeric features¶

Harmonics¶

Input data set¶

Train samples: 113568 (of which 5/6th [94640] is augmented data and 1/6th [18928] is original data)

Validation samples: 2366

Test samples: 2367

ML Models - CNN¶

ML Models - CNN¶

Classification report
              precision    recall  f1-score   support

           0       0.63      0.76      0.69        70
           1       0.74      0.68      0.71       101
           2       0.60      0.63      0.62        70
           3       0.76      0.60      0.67        53
           4       0.75      0.73      0.74        78
           5       0.74      0.68      0.71        93
           6       0.85      0.83      0.84        48
           7       0.77      0.80      0.78       127
           8       0.69      0.57      0.62        70
           9       0.56      0.50      0.53       132
          10       0.68      0.79      0.73       132
          11       0.81      0.68      0.74        69
          12       0.80      0.86      0.83       139
          13       0.67      0.68      0.68       149
          14       0.78      0.70      0.74        83
          15       0.71      0.66      0.68        82
          16       0.67      0.50      0.57        66
          17       0.66      0.83      0.74       105
          18       0.60      0.74      0.66        77
          19       0.64      0.58      0.61        71
          20       0.67      0.77      0.72        57
          21       0.74      0.79      0.76        76
          22       0.53      0.60      0.56       112
          23       0.74      0.66      0.70       118
          24       0.59      0.53      0.56        55
          25       0.59      0.60      0.59        67
          26       0.76      0.72      0.74        67

    accuracy                           0.69      2367
   macro avg       0.69      0.68      0.69      2367
weighted avg       0.69      0.69      0.69      2367
    

ML Models - MLP¶

ML Models - MLP¶

Classification report
              precision    recall  f1-score   support

           0       0.31      0.23      0.26        70
           1       0.54      0.19      0.28       101
           2       0.24      0.14      0.18        70
           3       0.18      0.09      0.12        53
           4       0.35      0.53      0.42        78
           5       0.32      0.78      0.46        93
           6       0.54      0.31      0.39        48
           7       0.43      0.34      0.38       127
           8       0.30      0.04      0.07        70
           9       0.43      0.55      0.49       132
          10       0.36      0.81      0.50       132
          11       0.44      0.17      0.25        69
          12       0.43      0.55      0.48       139
          13       0.27      0.13      0.17       149
          14       0.36      0.17      0.23        83
          15       0.23      0.17      0.19        82
          16       0.14      0.05      0.07        66
          17       0.46      0.30      0.37       105
          18       0.10      0.05      0.07        77
          19       0.22      0.07      0.11        71
          20       0.29      0.30      0.30        57
          21       0.38      0.33      0.35        76
          22       0.24      0.24      0.24       112
          23       0.28      0.74      0.41       118
          24       0.34      0.27      0.30        55
          25       0.38      0.22      0.28        67
          26       0.30      0.46      0.36        67

    accuracy                           0.34      2367
   macro avg       0.33      0.31      0.29      2367
weighted avg       0.34      0.34      0.31      2367
    

ML Models - LSTM¶

ML Models - LSTM¶

Classification report
              precision    recall  f1-score   support

           0       0.48      0.29      0.36        70
           1       0.27      0.28      0.27       101
           2       0.46      0.17      0.25        70
           3       0.26      0.17      0.20        53
           4       0.34      0.54      0.42        78
           5       0.51      0.32      0.39        93
           6       0.64      0.73      0.68        48
           7       0.62      0.83      0.71       127
           8       0.54      0.27      0.36        70
           9       0.22      0.30      0.25       132
          10       0.35      0.35      0.35       132
          11       0.33      0.39      0.36        69
          12       0.40      0.56      0.47       139
          13       0.44      0.39      0.41       149
          14       0.65      0.57      0.61        83
          15       0.34      0.33      0.34        82
          16       0.00      0.00      0.00        66
          17       0.33      0.45      0.38       105
          18       0.20      0.16      0.18        77
          19       0.29      0.28      0.28        71
          20       0.53      0.60      0.56        57
          21       0.28      0.37      0.32        76
          22       0.24      0.33      0.28       112
          23       0.62      0.53      0.57       118
          24       0.48      0.22      0.30        55
          25       0.37      0.42      0.39        67
          26       0.40      0.34      0.37        67

    accuracy                           0.39      2367
   macro avg       0.39      0.38      0.37      2367
weighted avg       0.39      0.39      0.38      2367
    

ML Models - Mixed¶

ML Models - CNN & MLP¶

Classification report
              precision    recall  f1-score   support

           0       0.62      0.77      0.69        70
           1       0.82      0.78      0.80       101
           2       0.64      0.69      0.66        70
           3       0.80      0.68      0.73        53
           4       0.84      0.72      0.77        78
           5       0.77      0.81      0.79        93
           6       0.88      0.88      0.88        48
           7       0.84      0.81      0.83       127
           8       0.62      0.57      0.59        70
           9       0.74      0.73      0.74       132
          10       0.76      0.86      0.81       132
          11       0.81      0.70      0.75        69
          12       0.89      0.90      0.90       139
          13       0.73      0.72      0.72       149
          14       0.81      0.76      0.78        83
          15       0.71      0.67      0.69        82
          16       0.67      0.53      0.59        66
          17       0.73      0.78      0.76       105
          18       0.66      0.68      0.67        77
          19       0.72      0.66      0.69        71
          20       0.75      0.82      0.78        57
          21       0.71      0.80      0.75        76
          22       0.56      0.71      0.62       112
          23       0.75      0.67      0.71       118
          24       0.73      0.60      0.66        55
          25       0.66      0.67      0.67        67
          26       0.71      0.75      0.73        67

    accuracy                           0.74      2367
   macro avg       0.74      0.73      0.73      2367
weighted avg       0.74      0.74      0.74      2367
    

ML Models - CNN & LSTM¶

Classification report
              precision    recall  f1-score   support

           0       0.66      0.80      0.72        70
           1       0.75      0.70      0.72       101
           2       0.63      0.64      0.64        70
           3       0.62      0.57      0.59        53
           4       0.79      0.76      0.77        78
           5       0.81      0.69      0.74        93
           6       0.86      0.90      0.88        48
           7       0.75      0.78      0.76       127
           8       0.64      0.59      0.61        70
           9       0.54      0.51      0.52       132
          10       0.75      0.79      0.77       132
          11       0.75      0.71      0.73        69
          12       0.84      0.83      0.84       139
          13       0.61      0.64      0.62       149
          14       0.72      0.70      0.71        83
          15       0.65      0.67      0.66        82
          16       0.53      0.45      0.49        66
          17       0.75      0.88      0.81       105
          18       0.56      0.66      0.61        77
          19       0.72      0.66      0.69        71
          20       0.67      0.81      0.73        57
          21       0.77      0.70      0.73        76
          22       0.48      0.60      0.53       112
          23       0.76      0.63      0.69       118
          24       0.69      0.53      0.60        55
          25       0.69      0.60      0.64        67
          26       0.68      0.72      0.70        67

    accuracy                           0.69      2367
   macro avg       0.69      0.68      0.69      2367
weighted avg       0.69      0.69      0.69      2367
    

ML Models - Macro-average ROC curve of multi-class predictors¶

ML Models - Transfer Learning¶

ML Models - GAN¶

Convolutional GAN

Your browser does not support the audio element.

WaveGAN

Your browser does not support the audio element.

Summing up¶

Further challenges¶
  • Testing on the soundscape dataset
Methods for improvement¶
  • Vary length of audio snippets and resolution parameters
  • Use different kinds of spectograms
  • Use more numeric data
  • Use a tree based algorithm on the numeric data
  • Remove audio snippets without birdcalls from training data
  • Distinguish between types of birdcalls
  • Adjust for non-uniform class distribution
If we had more time...¶
  • Automate hyperparameter optimization
  • Try unsupervised clustering
  • Score by top 3 appearences
  • Try self-supervised wav2vec

References¶

BirdCLEF 2021 https://www.kaggle.com/c/birdclef-2021/overview

WaveGAN https://arxiv.org/abs/1802.04208v3

ResNet https://github.com/KaimingHe/deep-residual-networks

DenseNet https://arxiv.org/abs/1608.06993v5

Bird Vocalizations https://en.wikipedia.org/wiki/Bird_vocalization

Very good talk about audio classification https://www.youtube.com/watch?v=uCGROOUO_wY

Good talk specifically about this subject https://www.youtube.com/watch?v=pzmdOETnhI0

Appendix¶

Packages¶

For audio:

  • Librosa (loading audio, generating spectograms)
  • noisereduce (reducing noise)
  • audiomentations (augmenting audio)
  • PIL (handling spectogram images)

Machine Learning:

  • tensorflow v2 (For building models)
  • scikit-learn (Evaluating models)

Data:

  • numpy
  • pandas
  • scipy

Visualization:

  • matplotlib
  • seaborn
  • plotly

Utility:

  • joblib
  • tqdm

Optimization¶

Multi-processing

Loading all the audio files, preprossesing them, and saving the final mel-spectograms takes a lot of time.


In order to speed things up, we used `joblib` as to enable multi-processing for the procedure. We chose this framework because it is also what is used internally in the Librosa package, and as such yielded us the best results.


Using 6 cores as simultaneous workers, we recuded time usage by a factor 3, from 12 hours to 4 hours.

Hardware

In order to run our models more efficiently, we took advantage of Google Colab's GPU ressourcers. This sped things up significantly, by a factor 20 from 300 seconds per epoch to 15.


Additionally, we relied on Colab's large amount of RAM (25 GB) to train the mixed models (and even then it would sometimes crash).

Model summaries¶

CNN¶

Model: "model_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_22 (InputLayer)        [(None, 48, 128, 1)]      0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 46, 126, 16)       160       
_________________________________________________________________
batch_normalization_8 (Batch (None, 46, 126, 16)       64        
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 23, 63, 16)        0         
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 21, 61, 32)        4640      
_________________________________________________________________
batch_normalization_9 (Batch (None, 21, 61, 32)        128       
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 10, 30, 32)        0         
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 8, 28, 64)         18496     
_________________________________________________________________
batch_normalization_10 (Batc (None, 8, 28, 64)         256       
_________________________________________________________________
max_pooling2d_10 (MaxPooling (None, 4, 14, 64)         0         
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 2, 12, 128)        73856     
_________________________________________________________________
batch_normalization_11 (Batc (None, 2, 12, 128)        512       
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 1, 6, 128)         0         
_________________________________________________________________
global_average_pooling2d_2 ( (None, 128)               0         
_________________________________________________________________
dense_54 (Dense)             (None, 512)               66048     
_________________________________________________________________
dropout_29 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_55 (Dense)             (None, 512)               262656    
_________________________________________________________________
dropout_30 (Dropout)         (None, 512)               0         
_________________________________________________________________
dense_56 (Dense)             (None, 128)               65664     
_________________________________________________________________
dropout_31 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_57 (Dense)             (None, 27)                3483      
=================================================================
Total params: 495,963
Trainable params: 495,483
Non-trainable params: 480
_________________________________________________________________

MLP¶

Model: "model_20"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_23 (InputLayer)        [(None, 15)]              0         
_________________________________________________________________
dense_58 (Dense)             (None, 64)                1024      
_________________________________________________________________
dense_59 (Dense)             (None, 64)                4160      
_________________________________________________________________
dense_60 (Dense)             (None, 64)                4160      
_________________________________________________________________
dropout_32 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_61 (Dense)             (None, 27)                1755      
=================================================================
Total params: 11,099
Trainable params: 11,099
Non-trainable params: 0
_________________________________________________________________

LSTM¶

Model: "model_21"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_24 (InputLayer)        [(None, 48, 128)]         0         
_________________________________________________________________
lstm_61 (LSTM)               (None, 48, 36)            23760     
_________________________________________________________________
lstm_62 (LSTM)               (None, 48, 32)            8832      
_________________________________________________________________
lstm_63 (LSTM)               (None, 48, 28)            6832      
_________________________________________________________________
lstm_64 (LSTM)               (None, 48, 24)            5088      
_________________________________________________________________
lstm_65 (LSTM)               (None, 48, 20)            3600      
_________________________________________________________________
lstm_66 (LSTM)               (None, 16)                2368      
_________________________________________________________________
dense_62 (Dense)             (None, 64)                1088      
_________________________________________________________________
dropout_33 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_63 (Dense)             (None, 64)                4160      
_________________________________________________________________
dropout_34 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_64 (Dense)             (None, 32)                2080      
_________________________________________________________________
dropout_35 (Dropout)         (None, 32)                0         
_________________________________________________________________
dense_65 (Dense)             (None, 27)                891       
=================================================================
Total params: 58,699
Trainable params: 58,699
Non-trainable params: 0
_________________________________________________________________

Merge layers¶

CNN & MLP¶

Model: "model_27"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_27 (InputLayer)           [(None, 48, 128, 1)] 0                                            
__________________________________________________________________________________________________
conv2d_16 (Conv2D)              (None, 46, 126, 16)  160         input_27[0][0]                   
__________________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 46, 126, 16)  64          conv2d_16[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_16 (MaxPooling2D) (None, 23, 63, 16)   0           batch_normalization_16[0][0]     
__________________________________________________________________________________________________
conv2d_17 (Conv2D)              (None, 21, 61, 32)   4640        max_pooling2d_16[0][0]           
__________________________________________________________________________________________________
batch_normalization_17 (BatchNo (None, 21, 61, 32)   128         conv2d_17[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_17 (MaxPooling2D) (None, 10, 30, 32)   0           batch_normalization_17[0][0]     
__________________________________________________________________________________________________
conv2d_18 (Conv2D)              (None, 8, 28, 64)    18496       max_pooling2d_17[0][0]           
__________________________________________________________________________________________________
batch_normalization_18 (BatchNo (None, 8, 28, 64)    256         conv2d_18[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_18 (MaxPooling2D) (None, 4, 14, 64)    0           batch_normalization_18[0][0]     
__________________________________________________________________________________________________
conv2d_19 (Conv2D)              (None, 2, 12, 128)   73856       max_pooling2d_18[0][0]           
__________________________________________________________________________________________________
batch_normalization_19 (BatchNo (None, 2, 12, 128)   512         conv2d_19[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_19 (MaxPooling2D) (None, 1, 6, 128)    0           batch_normalization_19[0][0]     
__________________________________________________________________________________________________
global_average_pooling2d_4 (Glo (None, 128)          0           max_pooling2d_19[0][0]           
__________________________________________________________________________________________________
dense_74 (Dense)                (None, 512)          66048       global_average_pooling2d_4[0][0] 
__________________________________________________________________________________________________
dropout_41 (Dropout)            (None, 512)          0           dense_74[0][0]                   
__________________________________________________________________________________________________
input_28 (InputLayer)           [(None, 15)]         0                                            
__________________________________________________________________________________________________
dense_75 (Dense)                (None, 512)          262656      dropout_41[0][0]                 
__________________________________________________________________________________________________
dense_77 (Dense)                (None, 64)           1024        input_28[0][0]                   
__________________________________________________________________________________________________
dropout_42 (Dropout)            (None, 512)          0           dense_75[0][0]                   
__________________________________________________________________________________________________
dense_78 (Dense)                (None, 64)           4160        dense_77[0][0]                   
__________________________________________________________________________________________________
dense_76 (Dense)                (None, 128)          65664       dropout_42[0][0]                 
__________________________________________________________________________________________________
dense_79 (Dense)                (None, 64)           4160        dense_78[0][0]                   
__________________________________________________________________________________________________
dropout_43 (Dropout)            (None, 128)          0           dense_76[0][0]                   
__________________________________________________________________________________________________
dropout_44 (Dropout)            (None, 64)           0           dense_79[0][0]                   
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 192)          0           dropout_43[0][0]                 
                                                                 dropout_44[0][0]                 
__________________________________________________________________________________________________
dense_80 (Dense)                (None, 32)           6176        concatenate_1[0][0]              
__________________________________________________________________________________________________
dropout_45 (Dropout)            (None, 32)           0           dense_80[0][0]                   
__________________________________________________________________________________________________
dense_81 (Dense)                (None, 27)           891         dropout_45[0][0]                 
==================================================================================================
Total params: 508,891
Trainable params: 508,411
Non-trainable params: 480
__________________________________________________________________________________________________

CNN & LSTM¶

Model: "model_33"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_31 (InputLayer)           [(None, 48, 128, 1)] 0                                            
__________________________________________________________________________________________________
conv2d_24 (Conv2D)              (None, 46, 126, 16)  160         input_31[0][0]                   
__________________________________________________________________________________________________
batch_normalization_24 (BatchNo (None, 46, 126, 16)  64          conv2d_24[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_24 (MaxPooling2D) (None, 23, 63, 16)   0           batch_normalization_24[0][0]     
__________________________________________________________________________________________________
conv2d_25 (Conv2D)              (None, 21, 61, 32)   4640        max_pooling2d_24[0][0]           
__________________________________________________________________________________________________
batch_normalization_25 (BatchNo (None, 21, 61, 32)   128         conv2d_25[0][0]                  
__________________________________________________________________________________________________
max_pooling2d_25 (MaxPooling2D) (None, 10, 30, 32)   0           batch_normalization_25[0][0]     
__________________________________________________________________________________________________
conv2d_26 (Conv2D)              (None, 8, 28, 64)    18496       max_pooling2d_25[0][0]           
__________________________________________________________________________________________________
input_32 (InputLayer)           [(None, 48, 128)]    0                                            
__________________________________________________________________________________________________
batch_normalization_26 (BatchNo (None, 8, 28, 64)    256         conv2d_26[0][0]                  
__________________________________________________________________________________________________
lstm_73 (LSTM)                  (None, 48, 36)       23760       input_32[0][0]                   
__________________________________________________________________________________________________
max_pooling2d_26 (MaxPooling2D) (None, 4, 14, 64)    0           batch_normalization_26[0][0]     
__________________________________________________________________________________________________
lstm_74 (LSTM)                  (None, 48, 32)       8832        lstm_73[0][0]                    
__________________________________________________________________________________________________
conv2d_27 (Conv2D)              (None, 2, 12, 128)   73856       max_pooling2d_26[0][0]           
__________________________________________________________________________________________________
lstm_75 (LSTM)                  (None, 48, 28)       6832        lstm_74[0][0]                    
__________________________________________________________________________________________________
batch_normalization_27 (BatchNo (None, 2, 12, 128)   512         conv2d_27[0][0]                  
__________________________________________________________________________________________________
lstm_76 (LSTM)                  (None, 48, 24)       5088        lstm_75[0][0]                    
__________________________________________________________________________________________________
max_pooling2d_27 (MaxPooling2D) (None, 1, 6, 128)    0           batch_normalization_27[0][0]     
__________________________________________________________________________________________________
lstm_77 (LSTM)                  (None, 48, 20)       3600        lstm_76[0][0]                    
__________________________________________________________________________________________________
global_average_pooling2d_6 (Glo (None, 128)          0           max_pooling2d_27[0][0]           
__________________________________________________________________________________________________
lstm_78 (LSTM)                  (None, 16)           2368        lstm_77[0][0]                    
__________________________________________________________________________________________________
dense_90 (Dense)                (None, 512)          66048       global_average_pooling2d_6[0][0] 
__________________________________________________________________________________________________
dense_93 (Dense)                (None, 64)           1088        lstm_78[0][0]                    
__________________________________________________________________________________________________
dropout_53 (Dropout)            (None, 512)          0           dense_90[0][0]                   
__________________________________________________________________________________________________
dropout_56 (Dropout)            (None, 64)           0           dense_93[0][0]                   
__________________________________________________________________________________________________
dense_91 (Dense)                (None, 512)          262656      dropout_53[0][0]                 
__________________________________________________________________________________________________
dense_94 (Dense)                (None, 64)           4160        dropout_56[0][0]                 
__________________________________________________________________________________________________
dropout_54 (Dropout)            (None, 512)          0           dense_91[0][0]                   
__________________________________________________________________________________________________
dropout_57 (Dropout)            (None, 64)           0           dense_94[0][0]                   
__________________________________________________________________________________________________
dense_92 (Dense)                (None, 128)          65664       dropout_54[0][0]                 
__________________________________________________________________________________________________
dense_95 (Dense)                (None, 32)           2080        dropout_57[0][0]                 
__________________________________________________________________________________________________
dropout_55 (Dropout)            (None, 128)          0           dense_92[0][0]                   
__________________________________________________________________________________________________
dropout_58 (Dropout)            (None, 32)           0           dense_95[0][0]                   
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, 160)          0           dropout_55[0][0]                 
                                                                 dropout_58[0][0]                 
__________________________________________________________________________________________________
dense_97 (Dense)                (None, 64)           10304       concatenate_3[0][0]              
__________________________________________________________________________________________________
dense_98 (Dense)                (None, 27)           1755        dense_97[0][0]                   
==================================================================================================
Total params: 562,347
Trainable params: 561,867
Non-trainable params: 480
__________________________________________________________________________________________________

Species legend¶

Label Name
amerob American Robin
barswa Barn Swallow
bewwre Bewick's Wren
blujay Blue Jay
bncfly Brown-crested Flycatcher
carwre Carolina Wren
compau Common Pauraque
comrav Common Raven
comyel Common Yellowthroat
eursta European Starling
gbwwre1 Gray-breasted Wood-Wren
grekis Great Kiskadee
houspa House Sparrow
houwre House Wren
mallar3 Mallard
norcar Northern Cardinal
normoc Northern Mockingbird
redcro Red Crossbill
rewbla Red-winged Blackbird
roahaw Roadside Hawk
rubpep1 Rufous-browed Peppershrike
rucspa1 Rufous-collared Sparrow
sonspa Song Sparrow
spotow Spotted Towhee
wbwwre1 White-breasted Wood-Wren
wesmea Western Meadowlark
yeofly1 Yellow-olive Flycatcher